Knowledge Discovery in Biomedical Literature

A. Beryl Joylin*, Prof. Nancy Victor

School of Information Technology and Engineering, VIT University, Vellore

*Corresponding Author E-mail: beryljoylin@gmail.com, nancyvictor@vit.ac.in

ABSTRACT:

One of the active and challenging research topics in data mining is to find useful knowledge from a collection of unstructured data, especially on the biomedical domain due to its complications and intricacies. With the amount of biomedical literature generated every day it is becoming an information-saturated field, building automated extraction tools to handle the large volumes of published literature is becoming more important. However, the task of making effective use of this consistently growing enormous amount of data especially in biomedical domain still remains a challenging question to many researchers. The goal of this project is to build a framework on which researchers can query and generate visualizations of the latest progress in biomedical publications or their areas of interest. PubMed, the largest repository of medical data maintained by the United States National Library of Medicine, is used as the source for journal abstracts and other metadata. Named Entity Recognition, specially developed for biomedical literature, is performed on the abstracts to extract entities like chemicals, drugs, proteins, and genes. The proposed system uses multinomial logistic regression model, combined with rule-based methods to identify biomedical entities in the abstracts. A graph database is built with the entities identified, along with metadata from the publication. A graph database is used because we can explicitly represent relationships between the nodes. This allows semantic querying to be performed on the database, which is useful for making complex queries with ease. The graph schema features a tree representation of timelines which makes building time related queries much more efficient. The graph database used is Neo4j, an open-source graph database implemented in Java. As a result, a system which can be used by researchers and pharmaceuticals to identify research trends is developed.

KEYWORDS: PubMed, Named Entity Recognition, Conditional Random Fields, Biomedical, Graph Database.

INTRODUCTION:

Knowledge Discovery is the process of finding useful knowledge from a collection of data. Knowledge discovery in an enormous amount of unstructured data is not an easy task. Specific understanding about the domain of text is required to extract the structured data from the unstructured one. For this kind of concept based learning to be performed by machines, supervised machine learning algorithms are used to make the machines understand the sentence structure and correctly identify the required information.

Everyday huge amount of biomedical literature is generated and thus it is becoming an information-saturated field. Hence, building Knowledge discovery tools to handle the large volumes of published literature is becoming more important. However, the task of making effective use of this enormous amount of data especially in biomedical domain still remains a challenging question to many researchers. Patents, scientific articles and medical reports are the main source of information for chemical entities and diseases. PubMed, the largest repository of medical data maintained by the United States National Library of Medicine, provides text abstracts of all the latest medical research publications.

As a first step of this knowledge discovery task, Named Entity Recognition is performed. Named Entity Recognition is the Natural Language Processing task of identifying words and phrases belonging to certain classes (e.g. genes and proteins). Since the task of Named Entity Recognition requires specific understanding about the domain of text, supervised machine learning algorithms are used to make the machines understand the sentence structure and identify the named entities on it. The Entrezapi provides the PubMed abstracts along with the metadata in the publication in XML format. The metadata along with the extracted entities can be put into a graph database to perform further analysis on it.

Even though the task of Named Entity Recognition has attracted a considerable amount of research in the newswire domain, this research could not flourish in the biomedical domain due to the many complexities. First, entity names which are in use is innumerable or it can be numbered in terms of millions and trillions[1] and moreover due to the constant addition of new entity names, no dictionary or a training data will be able to comprehend all the available entities. Second, the biomedical field is swiftly moving in an aim of building a common consent on the name to be used for a given entity[2] or to create a common name for all the entities with the exact concept defined by them[1]. But, the same kind of entity will have similar or identical names and acronyms when used different concepts[2]. This will give rise to significant ambiguities. Third, though there are naming conventions, most authors often do not follow those naming conventions and instead they prefer in introducing their own abbreviation which they use from the start to the end of the paper [2]. Finally, in general, both humans and automated systems can easily determine whether an entity name is present than it is to detect its boundaries but, the length of most of the entity names in biomedical literature/text are longer than the average length of names from other domains[1][3][4].

The Entrez API provides the PubMed abstracts along with the metadata in the publication in XML format. The metadata along with the extracted entities can be put into a graph database to perform further analysis on it.

Graph databases are the best way to represent and query connected data, of significant size and value. Connected data is data whose interpretation and value requires us first to understand the ways in which its constituent elements are related. More often than not, to generate this understanding, we need to name and qualify the connections between things. Graph databases address one of the great macroscopic business trends of today, leveraging complex and dynamic relationships in highly connected data to generate insight and competitive advantage.

In this project, the focus is to build a framework on which researchers can query and generate visualizations of the latest progress in biomedical publications or their areas of interest.

LITERATURE SURVEY:

The research on Named Entity Recognition(NER), the Natural Language Processing task of identifying words and phrases belonging to certain classes (eg. Location, People, Organization, Proteins, Genes), started over a decade ago in the newswire domain[5]. Some researchers have performed the task of Named Entity Recognition in Wikipedia articles[5]. Named Entity Recognition in the newswire domain as well as in the Wikipedia text focused on classifying the words and phrases belonging to location, person and Organization[5]. All the initial researches have used CoNLL(2003) corpora and machine learning algorithms such as Hidden Markov Models(HMM), Conditional Markov Models(CMM) and Conditional Random Fields(CRF) for generating classifier models[5]. As the research progressed, a team of researchers from Stanford university developed a Named Entity Recognition system called as Stanford Named Entity Recognizer(Stanford NER), a java implementation that used models trained using CRF classifier on a mixture of CoNLL, MUC-6, MU-7 and ACE named entity corpora[6].

While the task of Named Entity Recognition in the newswire domain has attracted considerable amount of research over the past 12 years, the biomedical domain on the other hand has consistently lagged behind. Though there are many complexities in performing Named Entity Recognition in biomedical domain, the lack of an extensively annotated corpus of the biomedical literature is another major bottleneck for applying Natural Language Processing techniques in the biomedical domain[7]. Further research in this field led to the development of various annotated corpora that assist in performing Natural Language Processing and thus recognize biomedical entities such as genes, proteins, drugs, chemicals and diseases. Some of the corpora in the literature are discussed below:

GENIA:

This is the first corpora developed to initiate Natural Language Processing (NLP) tasks in the biomedical domain. The GENIA corpus consists of approximately 2000 MEDLINE abstracts 4,00,000 words and almost 1,00,000 annotations for biological terms that indicate genes and proteins. For the development of this annotated corpus, articles with MeSH terms, blood, human and transcription factor was selected. This specific and narrowed selection of articles was made in order to make the annotation work converge with biological reactions involving transcription factors in human blood cells. The GENIA corpus encodes the articles in an XML format where each article contains title, MEDLINE ID, abstract. All the abstracts and titles were carefully annotated by two domain experts to label biologically meaningful terms. These terms are annotated semantically using the descriptors available in GENIA ontology which contains 47 biologically relevant ontologies. Overall, total number of annotations made is 96,582, out of which 89,682 annotations are made for surface level terms, and 1,583 are made for higher level terms. The total number of terms recovered is 93,293. Thus, this corpus is developed with an expectation that, Natural Language Processing techniques will be applied to recognize biological entities like genes and proteins found in the biomedical literature[7].

CHEMNDER:

CHEMDNER corpus is a collection of 10,000 PubMed abstracts with representative of various chemical compounds. The annotations of all PubMed abstracts were made by trained chemistry domain experts with background in literature duration. The CHEMDNER corpus as a whole was divided randomly into three subsets, the training set (3,500 abstracts), development set (3,500 abstracts) and test set (3,000 abstracts). Altogether, this corpus contains 84,355 chemical entity mentions corresponding to 19,806 unique chemical names [8]. It is an annotated corpus that challenges the data mining researches to perform Named Entity Recognition to identify the names of drugs and chemicals present in the biomedical literature[9].

NCBI Disease Corpus:

NCBI disease corpus contains 6,892 disease mentions and 790 unique disease concepts which are mapped to each other[10]. This corpus, which comprises of 793 PubMed abstracts annotated fully both at the mention and concept level is developed with an aim of serving as a research resource for the biomedical natural language processing community[10]. The 793 abstracts consist of more than 6000 sentences of which more than half of it constitute the disease names. Two annotators manually annotated each of the PubMed abstracts with disease mentions and their corresponding concepts in Medical Subject Headings (MeSH) or Online Mendelian Inheritance in Man(OMIM)[10].

Various classifiers can be used to generate trained model on the above-mentioned corpora. Identification of biomedical entities from unlabelled biomedical text is performed using the classifier model generated after training and testing. This task of identifying biomedical can either be considered as a sparse learning problem or a sequence labelling problem. Such a task involves supervised learning that uses some classifiers to generate a classifier model using which, classification and prediction is made. Some of the statistical modelling techniques used for generating the classifier models for such problems are Hidden Markov Models(HMM)[11], Conditional Random Fields(CRF)[12], Conditional Markov Models(CMM)[13], and Multinomial Logistic Regression models[14]. Many other researchers, have used these techniques in combination to generate many other customised models[6][15]. So, in addition to the availability of corpora, classifiers play a very important role in determining the named entities present in the text.

Existing System:

Over the past 12 years Named Entity Recognition is performed on newswire domain to identify entities like location, people, organization. But due to the lack of extensively annotated corpus researchers were unable to perform the task of identifying named entities in the biomedical domain. But, the amount of biomedical data generated in the form of articles, books and publications was consistently growing. Recently, with the development of annotated corpora like CHEMDNER Corpus, NCBI disease corpus, the present challenge is to identify Named entities in the biomedical literature and develop a system that serve as a platform to perform analysis on the biomedical research publications.

Proposed System:

In the biomedical field, a growing amount of data is continuously being produced. This growth is accompanied by a corresponding increase of textual information, in the form of articles, journals and books. Researchers need a robust system to find out the latest advances in the field. Other than metadata about a research paper like the date, authors, references, no information is available on the biomedical entities mentioned in the paper. Better relationships and inferences can be made between the publications, if entities mentioned in it can also be used for the analysis. The proposed system aims to extract these biomedical entities from the research paper, and build a graph database with these biomedical entities along with other meta data. Thus, a platform that serve to perform specialised analysis by companies and researches can be developed.

Each Module in the system is explained below:

1. Corpus Parser:

The corpus-parser module accepts the annotated corpora and parses the data into usable chunks. The output of the corpus-parser is text which includes the titles, abstracts of the articles and annotations. As the corpus comes in different formats, the format of the corpus is manually specified. In this project, the NCBI disease corpus and the CHEMDNER corpus are the main two corpora used. NCBI disease corpus comes in text format and hence a regular expression based parser is used to extract text, abstract and annotations. CHEMDNER corpus comes in XML format and hence an XML parser is used to extract text, abstract and annotations. The Element Tree XML API, a library for parsing the XML using python is the main parser used.

2. Article Processing:

The article processor splits the abstract and title text into tokens and generates feature vectors which is be used by the classifiers. The article processor is made up of many smaller modules, namely the sentence detector, tokenizer, POS Tagger, and Chunker. The task of sentence detection involves determining the sentence boundaries. To determine sentence boundaries, a syntactic dependency parse tree in which punctuation and capitalisation play an important role is used. This means that at least sentence boundaries will coincide with clause boundaries even if the sentence is poorly punctuated. Basically, tokenization is the task of splitting up the sentences into meaningful segments called tokens. The tokenizer takes a Unicode text as an input and returns an object as the output. This object is the combination of an instance with a lookup table that allows to access lexemes, a sequence of word strings and an optional sequence of spaces which are stored as Booleans. The sequence of spaces allows to maintain how the tokens are aligned in the original string. The tokens are tagged with a POS tag. Each of the tokens are tagged with labels like noun, verb, adverb, preposition, adjective, etc. These labels are very useful while generating the classifier models. During POS tagging, a rule based morphological features are identified for each of the tokens. During this process the root form of a word is modified or combined with one or more morphological features so that the surface form of the verb is created. For example, consider a sentence context “I am writing a letter”. In this sentence for the word “writing”, the surface form of this word is “writing”, root form is “write”, the POS tag is “verb”, and the morphological features are identified as verbform = Gerund. Similarly, other morphological features like mood, tense are identified depending on the context of the sentence. There are series of steps involved in tokenization. First, a mapping table in which a sequence of tokens is allowed to be mapped to multiple tokens. A Part of Speech tag or one or more morphological features is assigned to each of the tokens. Second, an extended POS tag which express the part of speech as well as some amount of morphological in formation like tenses etc. Third, for those words whose POS is not set during the previous process, another mapping table maps the tags to the part of speech and a set of morphological process. Finally, in the light of the assigned part of speech, without consulting the tokens context, the surface form is mapped to a root form or lemma by a rule-based deterministic lemmatizer.

3. Feature Extraction:

During this process, features that are required for identifying the entities are defined. The success of a classifier is determined by the feature extractor. It combines all the features generated by the article processor which includes POS Tag, surface form, root form, other morphological features like gerund, mood, tense etc, orthographic features generated by the parser like hyphenation, capitalization, punctuation, spacing, etc and other features to generate a feature vector. During training, this feature vector is used by the classifier to generate classifier model by making statistical probabilistic estimations. spaCy’s machine learning library, Thinc, which uses tested linear models for sparse learning problems is used for classification. This library uses the goodness of very powerful machine learning libraries like scikit-learn, tensorflow, keras and genism. Multinomial Logistic Regression model, a log linear model is the main statistical modelling technique used by this library for classification and prediction.

4. Training and Testing:

Both the training corpus and the testing corpus are parsed by the corpus parser, processed by the article processor and a feature vector is generated. This is the phase when the real machine learning happens. During the training phase, the feature vector generated and the parsed training dataset is given to the classifier to generate a trained model. The annotations used are specified to the trainer as token tags. Sometimes the character offset and the token boundary do not match. In such cases, that annotations are treated as missing value. Thus, the entity recognizer will be able to learn from examples that sometimes feature tokenizer errors. A tagging scheme called is used to define the boundaries for entities. The model used is, spaCy’s greedy transition based parser which is guided by a linear model generated using Multinomial Logistic regression classifier of scikit-learn. The weights of this model are learned by using the average-perceptron loss which uses dynamic oracle dependency parsing strategy. During the testing phase, the same process as in the training phase is done to label the entities in the pre-annotated testing dataset to evaluate the performance of the model generated.

5. Prediction:

The articles from PubMed are downloaded in the XML format. The XML Parser parses the XML document into usable bits. The abstract and title are passed into the article processor to pre-process the article before passing it to the machine learning module. In the machine leaning module, with the help of trained model generated during training phase, labels for each token in new abstracts are identified. The labels include the entity annotation and IOB tag. In IOB tag, “I” represents “In” which means the token is either an inner word or the last word, “O” represents “Out” which means that the token is not an entity, and “B” represents “Begin” which means that the token is the first word. Since the tokens in the output is not shuffled and occur in the same sequence, if the name of the entity spans over more than one word, the complete entity name is identified by using the I(In), O(Out), B(Begin) tags found with each token label. The structured data available from PubMed is directly moved to the graph building phase of the system.

6. Graph Building:

This phase creates a graph framework using the entities identified and other metadata from the publication. The graph schema design enables the user to easily find relations between different clusters of nodes. This may not be evident in a relational database. The graph database used is Neo4j, an open-source graph database implemented in Java. Neo4j provides official driver to connect with python and create nodes and relationships. The graph-builder module is built to allow easy querying to find obscure patterns in large amounts of data. This is because this graph database follows a property graph model according to which different nodes in the graph are connected via relationships that are efficiently stored and can hold any number of attributes. It can be used by researchers and corporations as a platform for data mining. The input to the graph builder phase is an array of json objects. Each object contains the metadata for each abstract along with the entities identified by machine leaning in the previous phase. The meta data include article title, identification number, author information, their affiliation, journal information, keyword information, date of publication etc. A time tree is featured in the schema to represent time in the graph as a tree of events (events in this case are dates on which journals or papers are published). The tree representation of timelines makes building time related queries much more efficient. As a result, a system which can be used by researchers and pharmaceuticals to identify research trends is developed.

RESULTS AND DISCUSSION:

Figure 2 shows a pubmed article in XML format. The XML parser converts this XML document into usable bits. The abstract and title are preprocessed by the article processor and the machine learning module performs Named Entity Recognition to find the disease and chemical classes in the article. The entities identified along with the other metadata is inserted into a graph database.

Figure 3 shows the chemical entities extracted from a given biomedical abstract. Figure 4 shows the disease entities identified from a given biomedical abstract. The proposed Named Entity Recognizer has attained a high accuracy and precision of approximately 96.5% and 92.3% respectively in disease corpus. With CHEMDNER corpus, the system has attained an accuracy and precision of approximately 93% and 92.3% respectively. Table 1 shows the Performance Evaluation of the proposed Named Entity Recognizer. These entities along with the other metadata related to publication is inserted into the graph database. Figure 5 shows one cluster of nodes in the graph database.

Table 1. Performance Evaluation of the proposed Named Entity Recognizer

	Chemical Corpus	Disease Corpus
True Positives	19,661	1368
False Positives	1472	113
False Negatives	12406	439
True Negatives	640250	21021
Accuracy	94.9403	96.5938
Precision	93.0346	92.3700
Recall	61.3123	75.7056
F-measure	73.9135	83.2117

Figure.5 One cluster of nodes in graph database

CONCLUSION:

This system serves as a platform for scientists, researches and companies to perform analysis on the biomedical publications. It is a scalable graph framework of structured data built by processing unlabelled data from openly published research literature, to be used as a platform for data mining. Any kind of complicated data patterns can be retrieved with robustness and efficiency. The Named Entity recognizer is efficient in identifying most of the entities present. Since millions of entities are in use and new entities are frequently added to the list, and no training will be able to comprehend them, newly added entities will not be identified. The average length of the names of biomedical entities is considerably longer and also, they span over multiple words. So, detection of start and end point of the entity names is not easy and sometimes one entity will be either split into two different entities or the name will be truncated when identified. As this kind of system can promote better research in biomedical domain, best efforts have to be made to improve the system in terms of accuracy and efficiency.

REFERENCES:

1. Wilbur, J.; L. Smith; and T. Tanabe. BioCreative 2. Gene Mention Task. Proceedings of the Second BioCreative Challenge Workshop. 2007; 7- 16.

2. Leser, U.; and J. Hakenberg. What makes a gene name? Named entity recognition in the biomedical literature. Briefings in Bioinformatics. 6; 2005; 357-369.

3. McCallum, A. Efficiently Inducing Features of Conditional Random Fields. Proceedings of the 19th Annual Conference on Uncertainty in Artificial Intelligence (UAI-03), San Francisco, California. 2005; 403-441.

4. Yeh, A.; A. Morgan; M. Colosimo; and L. Hirschman. BioCreAtIvE Task 1A: gene mention finding evaluation”. BMC Bioinformatics. 6 (1); 2003; S2.

5. Dominic Balasuriya Nicky Ringland Joel Nothman Tara Murphy James R. Curran. Named Entity Recognition in Wikipedia. Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources, ACM. 2009; 10 - 18.

6. Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating Non-Local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACM. 2005; 363-370.

7. J.-D. Kim, T. Ohta, Y. Tateisi and J. Tsujii. GENIA corpus—a semantically annotated corpus for bio-text mining, BMC bioinformatics, 19(1); 2003; i180–i182P.

8. M. Krallinger, O. Rabal, F. Leitner, M. Vazquez, D. Salgado, Z. Lu, R. Leaman, Y. Lu, D. Ji, D. M. Lowe et al. The chemdner corpus of chemicals and drugs and its annotation principles, Journal of cheminformatics.7(1); 2015; s2.

9. Martin Krallinger, Florian Leitner, Obdulia Rabal, Miguel Vazquez, Julen Oyarzabal, Alfonso Valencia. CHEMDNER: The drugs and chemical names extraction challenge, J. Cheminformatics, 7(1); 2015; s11.

10. R. I. Do ̆gan, R. Leaman, and Z. Lu. Ncbi disease corpus: a resource for disease name recognition and concept normalization. Journal of biomedical informatics. 47; 2014; 1–10.

11. Shaojun Zhao. Named Entity Recognition in Biomedical Texts using an HMM Model. Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, ACM. 2004; 84 -87.

12. S. Xu, X. An, L. Zhu, Y. Zhang, and H. Zhang. A crf-based system for recognizing chemical entity mentions (cems) in biomedical literature. Journal of Cheminformatics. 7(1); 2015; s11.

13. Corbett and A. Copestake. Cascaded classifiers for confidence-based chemical named entity recognition. BMC bioinformatics. 9 (11); 2008; s4.

14. Ioannis Korkontzelos, Dimitrios Piliouras, Andrew W. Dowsey, Sophia Ananiadou. Boosting drug named entity recognition using an aggregate classifier. Artificial Intelligence in Medicine.65(2); 2015; 145–153

15. H. Wang, T. Zhao, H. Tan, and S. Zhang. Biomedical named entity recognition based on classifiers ensemble. IJCSA. 5(2); 2008; 1–11.

Received on 13.04.2017 Modified on 17.05.2017

Research J. Pharm. and Tech. 2017; 10(6): 1911-1918.

DOI: 10.5958/0974-360X.2017.00335.3